Project: Star Hotels

By: Syeda Ambreen Karim Bokhari

Context

A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

Objective

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be cancelled. Star Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

Importing necessary libraries and data

Data Overview

Observations

Checking for missing values

Finding and removing duplicates

Converting object variables to category

Statistical Summary of dataset

Observations:

Observations on Categorical variables:

Exploratory Data Analysis

Exploratory Data Analysis (EDA)

Univariate Exploratory Data Analysis

Observations on Categorical Variables

Bivariate Exploratory Data Analysis

Questions:

  1. What are the busiest months in the hotel?
  2. Which market segment do most of the guests come from?
  3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
  4. What percentage of bookings are canceled?
  5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
  6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

Q1. What are the busiest months in the hotel?

Observations:

Q2. Which market segment do most of the guests come from?

Q3. What are the differences in room prices in different market segments?

Q4. What percentage of bookings are canceled?

Q5. What percentage of repeating guests cancel?

Q6. Do these special requirements affect booking cancellation?

Let us check which of these differences are statistically significant.

The Chi-Square test is a statistical method to determine if two categorical variables have a significant correlation between them.

Null Hypothesis - There is no association between the two variables.
Alternate Hypothesis - There is an association between two variables.

Observations:

If outliers are removed,

Identify Correlation in data

observations:

Data Preprocessing

Observations:

Exploratory Data Analysis Summary:

DataSet overview Observations

Discriptive Data Summary:

Observations on Categorical variables:

Observations on Categorical Variables

Questions:

  1. What are the busiest months in the hotel?
    • August is the busiest month with 5312 entries
    • July is 2nd busiest with 4725 entries
    • January is the least busiest month.
  2. Which market segment do most of the guests come from?
    • Most of the guest , 80.3% come from online market segment
    • As online market segment is outclassing others, cancellations are also mostly from online market.
    • Complimentary has no cancellation.
  3. What are the differences in room prices in different market segments?

    • Online market has the highest average room price: ~120. It may be because majority data is from online market.
    • Aviation has ~103
    • Complementary 2.8
    • Corporate 82.5
    • Offline 87.7
  4. What percentage of bookings are canceled?

    • 34% bookings are concelled
    • 66% bookings are not cancelled.
  5. What percentage of repeating guests cancel?
    • 0.% of repeating guests or only 10 in whole dataset cancelled the booking.
    • 34% of non repeating guests cancelled the booking
  6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?
    • 20.6% guests who cancelled booking did not have any special request.
    • 10.2% guests who cancelled booking have one special request,
    • 24.6% guest who did not cancel the booking, didn't have any special request.
    • 26.4% of guests who had 1 special request did not cancel the booking. #### Univariate Analysis lead_time has right skewed distribution. avg_price_per_room has right skewed distribution. no_of_special_requests has right skewed distribution. no_of_week_nights has right skewed distribution no_of_previous_cancellations:has right skewed distribution no_of_previous_bookings_not_canceled: has right skewed distribution It indicates presence of outliers.

Observations:

If outliers are removed,

Feature Engineering:

Combining date, month year after removing rows where month is February and date is greater than 28

Dropping the rows with arrival_date= 29 as there were only 28 days in February 2018

Combining arrival_date, arrival_month and arrival_year to single column

Treating outliers

Ecoding booking_status

Binning dates into yearly Months

EDA

Building a Logistic Regression model

Data Preparation

Checking Multicollinearity

These variables :'no_of_children','repeated_guest',"required_car_parking_space",'no_of_previous_cancellations','no_of_previous_bookings_not_canceled' are resulting in NAN when applied vif, so I'll revome these.

Logistic Regression (with Sklearn library)

checking performance on training set

checking performance on testing set

Logistic Regression (with statsmodels library)

Model evaluation criterion

Model can make wrong predictions as:

Which case is more important?

Observations

Now no feature has p-value greater than 0.05, so we'll consider the features in X_train3 as the final ones and lg2 as final model.

Coefficient interpretations

Converting coefficients to odds

The coefficients of the logistic regression model are in terms of log(odd), to find the odds we have to take the exponential of the coefficients. Therefore, odds = exp(b) The percentage change in odds is given as odds = (exp(b) - 1) * 100

Coefficient interpretations

Interpretation for other attributes can be done similarly.

Checking model performance on the training set

The confusion matrix

True Positives (TP): we correctly predicted that they will cancel the booking and they actually cancelled are 6402 or 21.50%

True Negatives (TN): we correctly predicted that they will not cancel the booking and they did not cancel are 17262 or 57.97%

False Positives (FP): we incorrectly predicted that they they will cancel the booking and they actually did not cancelled are (a "Type I error") 2364 or 7.94% Falsely predict positive Type I error

False Negatives (FN): we incorrectly predicted that they will not cancel the booking and they actually cancel (a "Type II error") 3750 or 12.59% Falsely predict negative Type II error

ROC - AUC

ROC-AUC on training set

Logistic Regression model is an ok recall and ROC-AUC score.

Model Performance Improvement

Let's see if the recall score can be improved further, by changing the model threshold using AUC-ROC Curve.

Optimal threshold using AUC-ROC curve

Checking model performance on training set

Checking model performance on training set

Model Performance Summary

Let's check the performance on the test set

Using model with default threshold

Using model with threshold=0.323

Using model with threshold=0.40

The confusion matrix

True Positives (TP): we correctly predicted that they will cancel the booking and they actually cancelled are 3170 or 24.84%

True Negatives (TN): we correctly predicted that they will not cancel the booking and they did not cancel are 6952 or 54.47%

False Positives (FP): we incorrectly predicted that they they will cancel the booking and they actually did not cancelled are (a "Type I error") 1483 or 11.62% Falsely predict positive Type I error

False Negatives (FN): we incorrectly predicted that they will not cancel the booking and they actually cancel (a "Type II error") 1258 or 9.07% Falsely predict negative Type II error

Model performance summary

Observations:

Final Model Summary

We'll consider the features in X_train3 as the final ones and lg2 as final model and threshold of 0.323 as final

Using model with threshold=0.323

The confusion matrix

True Positives (TP): we correctly predicted that they will cancel the booking and they actually cancelled are 3462 or 27.13%

True Negatives (TN): we correctly predicted that they will not cancel the booking and they did not cancel are 6249 or 50.53%

False Positives (FP): we incorrectly predicted that they they will cancel the booking and they actually did not cancelled are (a "Type I error") 1986 or 15.56% Falsely predict positive Type I error

False Negatives (FN): we incorrectly predicted that they will not cancel the booking and they actually cancel (a "Type II error") 866 or 6.79% Falsely predict negative Type II error

Model performance evaluation

Conclusion

Recommendations

Building a Decision Tree model

Decision Tree model

We will build our model using the DecisionTreeClassifier function. Using default 'gini' criteria to split.

Split Data

Checking model performance on training set

There is a huge disparity in performance of model on training set and test set, which suggests that the model is overfiitting

Visualizing the Decision Tree

Reducing over fitting

Using GridSearch for Hyperparameter tuning of our tree model

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

Maximum value of Recall is at 0.025 alpha, but if we choose decision tree will only have a root node and we would lose the buisness rules, instead we can choose alpha 0.004 retaining information and getting higher recall.

Checking performance on training set

Checking performance on testing set

Visualizing the Decision Tree

Creating model with 0.004 ccp_alpha

Checking performance on the training set

Checking performance on the testing set

Visualizing the Decision Tree

Do we need to prune the tree?

Model Performance Comparison and Conclusions

Actionable Insights and Recommendations

Conclusions

Recommendations